EffiaEcoev  :  mai22?68 


RELIABILITY  ANALYSIS  OF  ON-LINE  COMPUTER 
SYSTEMS  WITH  A  SPARE  PROCESSOR  USING  COMPUTER-AIDED  ALGEBRAIC  MANIPULATION 

BY 


RAHUL  CHATTERGY 


APPROVED  FOR  PUBLIC  RELEA! 
DISTRIBUTION  UNLIMITED 

JANUARY  1976  TECHNICAL  REPORT  A76-2 


H  \  a  o 


RELIABILITY  ANALYSIS  OF  ON-LINE  COMPUTER 


SYSTEMS  WITH  A  SPARE  PROCESSOR  USING  COMPUTER-AIDED  ALGEBRAIC  MANIPULATION 


by 


Rahul  Chattergy 
University  of  Hawaii 
Honolulu ,  Hawaii 


January  1976 


Sponsored  by 

Advanced  Research  Projects  Agency 
ARPA  Order  No.  1956 


***** 

The  views  and  conclusions  contained  in  this  document  are  those  of  the  author 
and  should  not  be  interpreted  as  necessarily  representing  the  official  policies 
either  expressed  or  implied,  of  the  Advanced  Research  Projects  Agency  of  the 
United  States  Government. 


RELIABILITY  ANALYSIS  OF  ON-LINE  OOMHJTER 
SYSTEMS  WITH  A  SPARE  PROCESSOR  USING  COMPUTER -AIDED  ALGEBRAIC  MANIPULATION 


Rahul  Chattergy 
University  of  Hawaii 
Honolulu,  Hawaii 


Abstract 


This  paper  discusses  the  reliability  of  operation  of  an  on-line  computer 
system  with  a  spare  processor,  described  by  a  semi-Markov  process  model. 
Analytical  solutions  are  obtained  by  using  computer-aided  algebraic  manipula¬ 
tion  techniques.  The  main  purpose  of  the  paper  is  to  demonstrate  that  the 
difficulties  of  obtaining  analytic  solutions  to  Markov  processes  by  standard 
techniques  can  be  considerably  reduced  by  the  application  of  algebraic  sym¬ 
bol  manipulation  languages.  To  the  author's  knowledge,  the  results  of  the 
reliability  analysis  are  also  new. 


Results  in  this  paper  were  obtained  by  using  MACSYMA,  available  at  MIT  Mathematics 
Laboratory,  supported  by  the  Advanced  Research  Projects  Agency  (ARPA) ,  Department 
of  Defense,  under  Office  of  Naval  Research  Contract  N00014-70-A-0362-0001. 

MACSYMA  was  accessed  via  the  ARPA  computer -cormunication  network. 
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Agency  of  the  Department  of  Defense  and  monitored  by  NASA  Ames  Research  Center 
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I.  INTRODUCTION 
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Consider  an  on-line  computer  system,  such  as  one  used  to  control  newspaper 
production.  To  increase  reliability  of  operation,  the  main  processor  is 
usually  backed  up  by  an  identical  unit.  Without  such  a  spare  processor,  a 
failure  of  the  main  processor  can  cause  missing  pulbication  deadlines,  result¬ 
ing  in  revenue  loss  from  advertisements.  The  reliability  of  operation  can 
be  enhanced  by  periodic  maintenance  of  the  processors.  We  assume  that  the 
processors  fail  only  when  in  operation.  When  maintenance  work  is  started  on 
the  active  processor,  the  spare  processor  is  put  into  operation.  If  the  active 
processor  does  not  fail  till  the  end  of  the  maintenance  work,  then  the  processor 
being  maintained  is  set  aside  as  the  spare  unit.  If  the  active  processor 
fails,  then  the  processor  being  maintained  or  the  spare  unit  is  immediately 
put  into  operation.  If  a  processor  fails,  then  repair  work  is  started  on  it 
immediately.  If  both  processors  fail,  then  the  first  to  fail  is  repaired 
first. 

This  system  can  be  characterized  by  four  states  listed  in  Table  1. 
Transitions  between  pairs  of  states  occur  at  randomly  distributed  instants  of 
time.  Therefore,  it  is  possible  to  describe  the  system  by  a  semi-Markov 
process.  The  allowable  transitions  between  pairs  of  states  are  shown  in  Fig.  1. 


II.  SEMI -MARKOV  PROCESS  MODEL 

Let  t^  denote  the  time  spent  by  the  process  in  state  i  before  a  trans¬ 
ition  to  seme  other  state  occurs.  We  define  the  waiting-time  distribution  in 
state  i  as 
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and  the  corresponding  density  function  and  mean  by  w^(t)  and  w^,  respectively. 
Let  the  failure-time,  repair-time  and  maintenance -time  distribution  be  [2,3] 

P[Failure-time<t]  =  1  -  expC-Lt)  , 

P  [Repair -tiflre<t]  «  1  -  exp(-Gt)  , 
and  P  [Maintenance  -  time<_t]  ==  1  -  exp(-Ht)  , 

respectively  for  t>0  and  zero  otherwise.  The  maintenance  schedule  is  assumed 
to  be  periodic  with  period  Tq  and  the  following  distribution 

PfStart  of  Maintenance  <_  t]  =  u(t  -  Tq)  , 

where  u  is  the  unit-step  function.  The  waiting-time  distribution  for  each  state 
can  now  be  computed  in  a  straightforward  maimer .  As  an  example  let  us  compute 
W^t).  Letting  F  denote  the  failure-time,  we  have 

Pftj  >  t]  =  P[min(T0,F)  >  t]  , 

=  P[T0  >  t]P[F  >  t]  , 

*  [1  -  u(t-T0)]exp(-Lt)  . 

W2(t)  =  1  -  P[t2  >  t]  , 

*  1  -  [1  -  u(t-TQ)] exp(-Lt)  , 

dW  (t) 

Hence,  w^(t)  =  — ^ —  =  [1  -  u(t-TQ)]L  exp(-Lt)  +  6(t-Tg)exp(-Lt) 

and  w1=[l-exp(-LTQ)]/L.  Figure  2  shows  W^(t)  as  a  function  of  t.  The  waiting¬ 
time  distributions,  their  density  functions  and  means  are  listed  in  Table  2. 

Let  p^ (t)  denote  the  conditional  probability  density  function  of  a  transition 
to  state  j  in  [t,t+A]  given  that  the  process  entered  state  i  at  time  zero  and 
the  next  transition  from  state  i  occurs  in  [t,t+A]  for  sufficiently  small  A>0. 
Then  the  probability  density  function  of  a  transition  from  state  i  to  state  j 
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Table  1 

STATE  DESCRIPTION  TABLE 

State 

One  active  processor  with  a  spare 
Both  processors  down' 

One  active  processor  with  the  other  in  maintenance 
One  active  processor  with  no  spare 


Table  2 

WAITING-TIME  DISTRIBUTIONS 


State  Distribution  Density 

1  l-[l-u(t-T0)]exp(-Lt)  [l-u(t-T0)]L  exp(-Lt) 

+6(t-TQ)exp(-Lt) 

2  l-exp(-Gt)  G  exp(-Gt) 

3  l-exp(-(L+H)t)  (L+H)exp(-(L+H)t) 

l-exp(-(L+G)t)  (L+G) exp ( - (L+G) t) 


i 


Label 

1 

2 

3 

4 


Mean 

[l-exp(LT0)]/L 

1/G 

1/  (L+H) 

1/  (L+G) 


4 
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after  waiting  t  units  of  time  in  state  i  is  given  by 

cij  Ct)  -  Pi.Ct)w.Ct). 

The  core  matrix  of  a  semi-Markov  process  is  defined  as  C(t)=[c^  (t)]  and  it 
provides  a  complete  probabilistic  description  of  the  process  [1] .  The  core 
matrix  of  this  process  is  shown  in  Table  3. 

Let  e^ (t) A  denote  the  probability  that  the  process  will  enter  state  j 
in  [t,t+A]  given  that  it  entered  state  i  at  time  zero  and  let  the  entry  matrix 
be  E(t)  =  [e^  (t)] .  Then 

E(s)  =  [I  -  C(s)]'1  , 

where  I  is  the  identity  matrix  and  E(s)  and  C(s)  are  Laplace  transforms  of  E(t) 
and  C(t)  respectively  (see  [1]).  The  matrix  C(s)  is  shown  in  Table  4.  Define 

E  3  lim[sE(s)]  . 
s-K) 

For  a  monodesmic  process,  such  as  the  one  discussed  here,  the  rows  of  E  are 
identical  [1] .  Let  ej  denote  the  j**1  element  of  any  row  of  E.  Then  the  limit¬ 
ing  interval  transition  probability  for  state  j ,  denoted  by  hj ,  is  given  by 

hj '  ■ 

Suppose  the  process  has  been  operating  unobserved  for  a  long  period  of  time. 

Then  hj  is  the  probability  of  the  event  that  the  process  will  be  in  state  j 
when  observed  next.  The  state  occupancy  statistics  can  also  be  obtained  from 
E(s) .  The  details  of  this  and  mean  first-passage  time  computations  can  be 
found  in  [1] . 

The  next  section  shows  how  analytic  expressions  can  be  obtained  for  hj , 
state  occupancy  statistics,  and  mean  first-passage  times  by  using  computer -aided 


Table  3 


(DRE  MATRIX 

12  3  4 

1  0  0  6(t-T0)exp(-Lt)  [u(t)-u(t-TQ)]L  exp(-Lt) 

2  0  0  0  G  exp(-Gt) 

3  H  exp(-(L+H)t)  0  0  L  exp(-(L+H)t) 

4  G  exp(-(L+G)t)  L  exp(-(L+G)t)  0  0 


Table  4 

LAPLACE  TRANSFORM  OF  CORE  MATRIX 
12  3  4 

1  0  0  exp(-(S+L)T0)  L(l-exp(- (S+L)TQ)/ (S+L) 

2  0  0  0  G/CS+G) 

3  H/(S+L+H)  0  0  L/CS+L+H) 


4  G/ (S+L+G)  L/(S+L+G) 


0 


0 


Page  7 


algebraic  manipulation  techniques,  This  material  was  generated  interactively, 
by  using  the  symbol  manipulation  language  called  MACSYMA  available  at  the  MIT 
mathematics  laboratory,  accessed  via  the  ARPA  computer  communications  net¬ 
work. 


III.  ANALYTIC  SOLUTIONS  VIA  MACSYMA 

The  original  computer  outputs  did  not  have  any  comments.  The  comments 
imbedded  between  the  pairs  of  characters  "/*"  and  "*/"  are  added  to  explain 
the  procedure. 


/*  TIME: TRUE  PRINTS  CPU  TIME  USED  IN  EACH  STEP  IN  MILLISECONDS:  */ 

(Cl)  TIME: TRUE  $ 

TIME=  8  MSEC. 

/*  WAIT=ROW  VECTOR  OF  MEAN  WAITING  TIMES:  */ 

(C2) 

WAIT:MATRIX([(l-%E-(-L*TO))/l ,1/G,1/ (L+H) ,1/ (L+G)]) ; 

TIME=  US  MSEC. 

[  -  L  TO  ] 

(02)  [  1  -  %E  1  1  1  ] 

[  L  G  L  +  H  L  +  G] 


/*  SCORE=LAPLACE  TRANSFORM  OF  THE  CORE  MATRIX  =  C(S).  */ 

/*  SCORE  IS  ENTERED  ROW  BY  ROW ,  EACH  ROW  ENCLOSED  IN  [  ]  .  */ 

/*  %E  DENOTES  THE  EXPONENTIAL  "e"  and.  ~  DENOTES  EXPONENTIATION:  */ 


(C3) 


SCORE: MATRIX 


([0,0, %E" (-)) (S+L)*TO) , (L/ (S+l))* (1-%E' (-(S+L)*TO))] , 
[0,0, 0,G/ (S+G)] , 

[H/  (S+ L+H) ,  0 , 0 ,  L/  (S+L+H)  ]  , 

[G/ (S+L+G) ,//L/ (S+L+G) ,0,0]); 

TIME*  244  MSEC. 


[  -  (S  +  L)  TO  ] 

[  -  (S  +  L)  TO  L  Cl  -  %E  )  ] 

[0  0  %E  . . J 

[  S  ♦  L  ] 

[  ] 

I  G  ] 

[  0  0  0  -  ] 

(D3)  [  S  +  G  ] 

[  ] 

[  H  L  ] 

[ .  0  0  . —  ] 

[  S  +  1  +  H  S  +  L  +  H  ] 

[  ] 

[  G  L  ] 

[  — .  0  0  ] 

[S  +  L  +  GS  +  L  +  G  ] 


/*  SENTRY =E (S) ^INVERSE  OF  (IDENTITY -SCORE) .  AA  DENOTES  NONCOMMUTATIVE  */ 
/*  EXPONENTIATION.  INVERSE  OF  MATRIX=MATRIX  '"-1:  */ 


(C4) 

SENTRY: (IDENT (4) -SCORE) AA-1  $ 

TIME=  56426  MSEC. 

/*  TEMPO =S *FIRS T  ROW  OF  SENTRY .RATSIMP  IS  AN  OPERATOR  USED  FOR  SIMPLIFICATION:  V 

(C5)  TEMPO: RATS IMP (S*ROW (SENTRY, 1))  $ 

TIME=  128761  MSEC. 

/*  LMT=LIMIT  OF  TEMPO  WHEN  S—>0:  */ 

(C6)  IHT: LIMIT (TEMPO, S,0)  $ 

LIMIT  FASL  DSK  MACSYM  BEING  LOADED 
LOADING  DONE 
TIME=  2251  MSEC. 

/*  LMTDST=ROW  VECTOR  OF  STEADY  STATE  PROBABILITY  DISTRIBUTION:  */ 


(C7)  LMTOST: RATSIMP (LMT*WAIT)  $ 

TIME*  31529  MSEC. 

(C8)  LMTDST[1,1]; 


TIME*  6  MSEC. 


(L  +  (H  +  G)  L  +  (GH  +  G)  L  +  G  H)%E  -HL  -GHL 

(C9) 

LMTDST[1 ,2] ; 

TIME=  5  MSEC. 


2 

L 

CD9)  — . . 

2  2 
L  +  G  L  +  G 

(CIO) 

LMTDST[1,3]; 

TIME=  5  MSEC. 

(DIO) 


2 

G  L 


3  2  2  2  L  TO 

(L  +  (H  +  G)  L  +  (G  H  +  G  )  L  +  G  H)  %E 

Ceil) 

LMTDST[1 ,4] ; 


TIME=  5  MSEC. 
(Dll) 


G  L 


2  2 


L  +  G  L  ♦  G 

/*  CHECK  ON  THE  SUM  OF  THE  PROBABILITIES  OF  LMTDST: 
(Cl  2) 


RATS  IMP  (LMTDST .  TRANSPOSE  ([1,1,1,!])) 


TIME=  6407  MSEC. 
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(D12)  1 

/*  TOCUP[l,2]=MEAN  NO.  OF  TIMES  STATE  2  IS  VISITED  IN  [0, T]  STARTING  IN  */ 
/*  STATE  1  AT  TIME  ZERO.  SOCUP[lJ2]=LAPLACE  TRANSFORM  OF  TOCUP[l}2]:  */ 


(Cl  9) 

SOCUP[l,2] : RATSIMP (SENTRY [1, 2] /S)  $ 

TIME=  366  MSEC. 

/*  ILT  OPERATOR  COMPUTES  THE  INVERSE  LAPLACE  TRANSFORM:  V 

(C20)  TOCUP[l,2] : ILT(SOCUP[l ,2] ,S,T)  $ 

LAP LAC  FASL  DSK  MACSYM  BEING  LOADED 
LOADING  DONE 

/*  INFORMATION  REQUESTED  BY  THE  LAPLACE  TRANSFORM  ROUTINE:  */ 

IS  G  L  POSITIVE,  NEGATIVE  OR  ZERO? 

/*  ANSWER  ENTERED  FROM  THE  TERMINAL:  V 

POSITIVE; 

TIME=  10625  MSEC. 

/*  COMPUTE  TOCUP[l,2 ]  FOR  G=10,  TO=1/10  AND  H=10:  */ 

(C21)  %,L=1 ,G=10,TO=1/10,H*10; 

TIME=  910  MSEC. 

-  11  T  89  SINH(SQRT(10)  T)  109  C0SH(SQRT(10)  T)  10  T 


(D21)  %E  ( . - . +  — . — )  + - 

12321  SQRT(10)  12321  111 


109 


12321 

/*  T0CUP[lt2]  HAS  A  LINEAR  TERM  IN  T  WHICH  WILL  BE  DOMINANT  FOR  LARGE  */ 
/*  VALUES  OF  T.  PART  FUNCTION  IS  USED  TO  SELECT  ANY  PART  OF  AN  */ 

/*  EXPRESSION.  THE  LINEAR  TERM  IN  T  IS:  V 

(C22) 

PART (T0CUP [1 , 2] , 2) ; 

TIME®  76  MSEC. 

2 

G  L  T 

(Di*j  -  - 

2  2 
L  +  GL  +  G 


IV.  DISCUSSION  OF  THE  RESULTS 


Two  important  parameters  for  estimating  the  reliability  of  operation  of 
this  system  are  h^  and  h2,  respectively  the  probabilities  of  being  operational 
with  a  spare  unit  and  completely  shut-down,  in  the  steady  state.  Let 
D=  (L+H) (L2+GL+G2) .  Then 


\  =  G2  [  (L+H)  (exp(LT0M)]/[D  exp(LT0)-H(L2+GL+G2)]  , 

(1) 

h2  =  L2/(L2+GL+G2). 

(2) 

Also 

h3  =  G2L/[D  exp (LTq) -H(L2+GL+G2) ]  , 

(3) 

and 

h4  *  GL/(L2+GL+G2)  . 

(4) 

During  the  interactive  session,  we  verified  that  h^+h2+h3+h^=l  as  it  should  be. 
As  a  further  check  on  our  results  we  consider  the  system  without  any  preventive 
maintenance  work.  By  letting  Tq  go  to  infinity  we  eliminate  all  maintenance 
work  in  the  future  except  that  starting  at  time  zero.  The  corresponding  state 
transition  diagram  is  shown  in  Fig.  3.  In  this  case  we  have 

1^  =  G2(L+H)/D, 
h2  =  L2/(L2+GL+G2)  , 
h3  =  °, 

h4  -  GL/(L2+GL+G2)  . 


and 
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Now  letting  H  go  to  infinity  we  eliminate  state  3  from  our  process  ani  obtain 


1^  =  G2/CL2+GL+G2)  , 

(5) 

=  L2/(L2+GL+G2)  , 

(6) 

1^  =  0, 

(7) 

and 

h4  =  GL/ (L2+GL+G2)  . 

(8) 

The  steady  state  distribution  for  the  resulting  three  state  process  can  be 
computed  by  hand  and  they  agree  with  equations  (5) ,  (6)  and  (8) . 

Now  let  us  consider  the  effect  of  maintenance  on  h^.  We  assume  that  when 
maintenance  is  done  H  is  a  constant  and  the  reciprocal  of  the  average  failure¬ 
time  L  is  a  function  of  Tq,  i.e.,  L=F(Tq).  F  is  assumed  to  be  a  nondecreasing 
function  with  F(0)=LQ^0  and  F(»)*L^«».  Without  maintenance,  h^  is  computed  from 
Equation  (5)  to  be 

^  «  G^ClJ-KSL^2).  (9) 

For  a  fair  comparison,  should  be  compared  with  hj+hj  when  maintenance  is 
present.  Using  equations  (1)  and  (3)  we  have 

hj*  +  h3*  =  G2/  (F (Tq)  2+GF  CTq) +G2)  . 


(10) 


Since  F  is  assumed  to  be  a  nondecreasing  function,  FCTq)<F(«)=L1  and  hence  from 
equation  (9)  and  equation  CIO)  we  have 

*  * 

h  -  h  *  ^  * 

i.e.,  maintenance  increases  the  probability  of  the  system  being  operational 
in  the  steady  state. 

Next  let  us  consider  a  state -occupancy  statistic  of  relevance  to  the 
reliability  of  the  system.  Let  N^CT)  denote  the  average  number  of  times 
the  process  visits  state  2  in  the  time  interval  [0,T],  starting  in  state  1 
at  time  zero.  Then  for  large  values  of  T 


Let 


N^CO  -  GL2!/  (L2+GL+G2)  . 

Q(L)  =  GL2/(L2+GL+G2)  =  Gh2 


(11) 


Then  it  is  easy  to  verify  that  Q(0)=0,  Q(°°)=G,  dQ/dL>0  for  all  G,L>0  and 


and 


^-2  >  0  for  G3  >  L2  (L+3G) 
dL^ 


^-2  <  0  for  G3  <  L2  (L+3G) 
dL^ 


We  can  sketch  Q(L)  as  a  function  of  L  which  is  shown  in  Fig.  4.  To  minimize 
Nj^CT)  we  have  to  minimize  Q(L)  and  this  can  be  done  by  reducing  Tq. 

Next  let  us  consider  the  average  first-passage  times  between  pairs  of 
states.  Let  T^  denote  the  average  first-passage  time  from  state  i  to  state 
j.  Then  we  define  T^  as  the  recovery-time  of  the  system  from  complete  breakdown 
in  state  2  to  fully  operational  in  state  1.  The  unknown  quantities  T^  satisfy 
a  set  of  linear  algebraic  equations  [1]  which  can  be  easily  solved  by  MACSYMA. 


T21  is  linearly  related  to  L  and  can  be  reduced  by  decreasing  L.  From 
equations  (11)  and  (12)  we  conclude  that  reducing  Tq  will  reduce  N12(T)  and 
T2i  by  decreasing  L. 

V.  CONCLUSIONS 

Hie  same  remarks  made  in  [4]  are  also  valid  here. 
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